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Abstract 

Problem Statement: There have been many attempts to research the effective 
assessment of writing ability, and many proposals for how this might be 
done. In this sense, rater reliability plays a crucial role for making vital 
decisions about testees in different turning points of both educational and 
professional life. Intra-rater and inter-rater reliability of essay assessments 
made by using different assessing tools should also be discussed with the 
assessment processes. 

Purpose of Study: The purpose of the study is to reveal possible variation or 
consistency in grading essay writing ability of EFL writers by the 
same/ different raters using general impression marking (GIM), essay criteria 
checklist (ECC), and essay assessment scale (ESAS), and discuss rater 
reliability. 

Methods: Quantitative and qualitative data were used to present the 
discussion and implications for the reliability of ratings and the consistency of 
the measurement results. The assessing tools were applied to 44 EFL 
university students and 10 graders assessed the essay writing ability of the 
students by using GIM, ECC, and ESAS in different occasions. 

Findings and Results: The findings and results of the analyses indicated that 
using general impression marking is evidently not reliable for assessing 
essays. The coefficients obtained from checklist and scale assessments, 
considering the correlation coefficients, estimated variance components, and 
generalizability coefficients present valuable information, clearly show that 
there is always variation among the results. 
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Conclusions and Recommendations: When the total scores and the rater 
consensus results in this study are examined, it can be clearly seen that the 
scores are almost always not identical and they are different from each other. 
For this reason, opposed to the idea that is commonly agreed upon, checklists 
or even scales may not be effectively as reliable as expected and they may not 
improve inter-reliability or intra-reliability of ratings unless the raters are very 
well-trained and they have strong agreement or common inferences on 
performance indicators and descriptors since they should not have 
ambiguous interpretations on the criteria set. The results might be more 
accurate and reliable if the accepted interpretation of a meaningful correlation 
coefficient for this kind of measurements can be considered as .90 minimum 
for giving evidence of reliable ratings. This might mean that the proximity of 
the scores which are assigned to same or independent essays will be higher 
and more similar. However, the scale use could still be emphasized as more 
reliable. Still, an elaborate and careful examination with more raters is seen 
needed. 

Keywords: Essay, assessment, intra-rater, inter-rater, reliability. 


Assessing writing ability and the reliability of ratings have been a challenging 
concern for decades and there is always variation in the elements of writing 
preferred by raters and there are extraneous factors causing variation (Blok, 1985; 
Chase, 1968; Chase, 1983; Darus, 2006; East, 2009; Engelhard, 1994; Gyagenda & 
Engelhard, 1998a; Gyagenda & Engelhard, 1998b; Hughes, Keeling & Tuck, 1980; 
Hughes, Keeling & Tuck, 1983; Hughes & Keeling, 1984; Kan, 2005; Klein & Hart, 
1968; Klein & Taub, 2005; Marshall & Powers, 1969; Murphy & Balzer, 1989; Schaefer, 
2008; Slomp, 2012; Sulsky & Balzer, 1988; Wexley & Youtz, 1985; Woehr & Huffcutt, 
1994). Fisher, Brooks, and Lewis (2002) state fitness for purpose requirement is the 
core of all testing work, and direct writing assessments are subjective and thereby 
more prone to reliability issues. For this reason, many raters use scoring scales or 
rubrics because they believe that any assessment without a scale is based on 
subjective judgments and general impression. Some researchers also state that not 
only general impression marking but also holistic assessment with a set of criteria 
can be highly subjective (Hamp-Lyons, 1991; Vaughan, 1991) and scores can vary in a 
significant way. Huot (1990) states that the levels of reliability achieved with holistic 
assessment are generally lower than that achieved with analytic assessment 
(Johnson, Penny, & Gordon, 2001). In this respect, general impression marking and 
holistic assessment can be called as subjective but analytic assessment can be called 
objective-like or systematically subjective because, all in all, each indicator of criteria 
is scored subjectively (Kayapinar, 2010). Even if it seems more reliable than the 
others, there is still a set of criteria which is implicit or explicit for different types of 
assessment. Moreover, a comparison of reliability measures by using different 
assessment tools is seen necessary in order to provide evidence going beyond any 
claim and reaching the proof of assessing essays consistently because the rating 
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methods -holistic or analytic- used by the raters can change their application of 
rating criteria (Huang, 2012). 

In this article, general impression marking refers to handling with an essay as a 
whole with a subjective judgment (Hamp-Lyons, 1992). For this reason, no tool was 
addressed for this type of assessment in the study. Holistic assessment refers to 
scoring the overall product as a whole, with judging the predetermined component 
parts separately (Mertler, 2001; Nitko, 2001). For this type of assessment, a checklist 
entitled Essay Criteria Checklist (App.l) was employed. A rating scale entitled Essay 
Assessment Scale (App.2) was used for analytic assessment which refers to scoring 
the levels of the product with individual predetermined criteria and obtaining a total 
score by the sum of the individual scores (Moskal, 2000; Nitko, 2001; Weir, 1990). 

Considering the measures of rater reliability and the carry-over effect, the basic 
research question guided in the study is in the following: 

Is there any variation in intra-rater reliability and inter-reliability of the writing 
scores assigned to EFL essays by using general impression marking, holistic scoring, 
and analytic scoring? 

Method 


Sample 

Three study groups were randomly chosen and employed as follows: Judges. 
Judges (h= 103) include faculty of ELT departments from different (20) universities. 
They evaluated the appropriateness and validity of the checklist items (App. 1) and 
the criteria and performance indicators of the scale (App. 2). Raters. Raters (n= 10) 
who assessed the essays are ELT experts (MAs and PhDs) and experienced teachers 
of writing skill (at least 2 years). EFL students. The students (n= 44) who responded 
the essay test produced the essays in testing conditions for Advanced Reading and 
Writing class. 

Research Instruments 

Tire writing samples. Forty-four scripts of one essay sample written in testing 
conditions in order to achieve the objective : 

"By means of the awareness of essay types, essay writers will analyze, synthesize 
and evaluate information and therefore, in their compositions, react to prompts. 
Essay writers will also be able to analyze and produce different types of essays (e.g. 
comparison and contrast, classification, process analysis, cause-and-effect analysis, 
and argumentative) that are unified, coherent, and organized." The essay prompt, 
which was produced by the teachers of the particular class, is the same for all 
students as: Please ivrite an essay about the topic "University students should be 
free to choose their own courses." 

Essay Criteria Checklist (ECC). The checklist was developed in order to measure 
each construct of essay writing. First of all, a criteria list was written through a 
review of relevant literature (Raimes, 1983; Norton, 1990; Celce-Murcia, 2001; 
Johnson, Penny, & Gordon, 2001; Jacobs et al. 1981 in Weigle, 2002; Weigle, 2002; 
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Bowen and Cali, 2004; Hawkey & Barker, 2004; Darus, 2006; IELTS, 2007; Dempsey, 
PytlikZillig, & Bruning, 2009; Knoch, 2009). Next, 103 faculty from ELT departments 
from different (20) universities examined the appropriateness of the checklist 
considering the expressions used and the consistency between the objectives and 
constructs of essay writing skill and the checklist items. The ratio of agreement (P) 
(Erku§, 2003) was found significantly high (P=96.1; P= the number of judges agreed 
on each criterion/total number of judges). Later, two experts of measurement and 
evaluation examined the checklist considering the content and technical features. 

Essay Assessment Scale (ESAS). The scale was developed in order to describe and 
measure each construct of essay writing skill with performance levels. First, 103 
faculty of ELT departments from different (20) universities examined the scale 
considering the expressions used and the consistency between the objectives and 
constructs of essay writing skill and the performance indicators included. The ratio 
of agreement (P) of the scale is also .96.1. Next, two experts of measurement and 
evaluation examined the scale considering technical features. Finally, a Likert type 
scale covering five performance levels (0-1-2-3-4) was developed by using expert 
judgments. Five performance levels were chosen because of easiness and usefulness 
for the observable behavior although there is no limit for performance levels (Kan, 
2007). 

The measurement results: The total scores of 2640 ((10 raters x 44 essay scripts) * 6 
independent sessions) essay scripts, which were randomly selected, were used to 
measure the reliability of ratings, using GIM, ECC, and ESAS. 

Standardized open-ended interviews. The raters were asked the following 
standardized open-ended interview questions about the assessment process: 

1. "What do you think of the assessments you made by using GIM?" 

2. "What do you think of the assessments you made by using ECC?" 

3. "What do you think of the assessments you made by using ESAS?" 

A pretest of the interview questions was carried out by two independent raters 
and two experts of measurement and evaluation in order to identify the validity and 
the effectiveness of the questions. 

Procedure 

The procedure of the study includes two phases: The production of the material to 
be scored. The essays were produced in testing conditions of an advanced reading and 
writing class. Each essay was given a different code assigned randomly for each 
rating after the names had been deleted. 

Assessment Design. There are ten raters and six different rating processes in the 
study. Before the raters started each rating session, they had been given a short 
educational session and instructions for a proper completion of each session. Each 
rater scored each essay at a time -44 essays in one batch and 264 essays in total. Each 
rating session was held after a 10-week break in order to remove the carry-over effect 
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of the previous assessment. In order to balance the objectivity, the order and the 
numbering of the essays were changed before each session and they were assigned 
random codes. 

Data Analyses 

In order to determine the intra-rater reliability of the ratings, the correlation 
coefficients between the two gradings of the same raters for the same essays were 
computed by using Pearson Product Moments Correlation Analysis. The correlation 
coefficients were also examined by using Fischer's z Transformation to test the 
significance of the variation in correlation coefficients. This procedure led the way to 
put the correlation coefficients in order. ANOVA was employed in order to present 
evidence for the inter-rater reliability of ratings. The differences in the scores across 
the task and the raters by using GIM and ESAS were also interpreted through a 
generalizability study. A series of person x rater x task were performed to examine 
the variation of scores due to potential effects of person, rater, and task after the 
variance components had been estimated. Using standardized open-ended 
interviews revealed the reflections and views of the raters on their own rating 
process. The qualitative data here were analyzed line by line and memos were 
written (Glesne, 1999; Strauss & Corbin, 1998). Categories were reviewed and 
recurring themes, core consistencies and meanings were identified by using pattern 
codes. Those explanatory pattern codes were later identified as smaller sets and 
themes with content analysis (Miles & Hubermas, 1994; Patton, 2002). The process 
includes: Underlying key terms in the responses, restating key phrases, coding key 
terms, pattern coding, constructing themes, and corporating themes into an 
explanatory framework 
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Results 


Intra-rater reliability. 


Table 1 shows the intra-rater consensus between GIM assessments. 

Table 1 

Intra-rater Consensus between GIM Assessments 


Difference 

R1 


R2 


R3 


R4 

R5 


R6 


R7 


R8 


R9 


R10 



t 

% 

t 

% 

£ 

% 

£ 

% 

£ 

% 

i 

% 

f 

% 

{ 

% 

f 

% 

f 

% 

0 

7 

16 

2 

5 

6 

14 

6 

14 

1 

2 

i 

2 

i 

2 

9 

21 

7 

15 

3 

7 

±1-5 

9 

21 

17 

38 

8 

18 

18 

41 

13 

30 

30 

68 

10 

23 

7 

15 

1 

3 

18 

41 

±6-10 

8 

18 

7 

15 

9 

21 

8 

18 

13 

30 

12 

27 

6 

14 

7 

15 

7 

15 

14 

32 

±11-15 

9 

21 

6 

14 

8 

18 

4 

9 

6 

14 

0 

0 

6 

14 

9 

21 

2 

5 

2 

5 

±15- 

more 

11 

25 

12 

27 

13 

30 

8 

18 

11 

25 

1 

2 

21 

47 

12 

27 

12 

27 

7 

15 

TOTAL 

44 

100 

44 

100 

44 

100 

44 

100 

44 

100 

44 

100 

44 

100 

44 

100 

44 

100 

44 

100 


R=Rater 


Table 1 shows that Rater 6 scored 31 essays out of 44 with a ±0-5-point difference 
on 0-100 point scale. This is the highest value among the others referring that 70% of 
the essays have similar results in two assessments made by using GIM. The 
assessments of Rater 9 have the lowest percentage of consensus which is 18% with a 
±0-5-point difference. The frequency is 7 for zero difference, and 1 for ±l-5-point 
difference. Other raters' consensus between two assessments by using GIM has a 
frequency range between 11 and 21 points. Table 2 also indicates that the percentages 
of the scores which are the same in two assessments have a range between 2 and 21. 
This means that the frequencies range between 1 and 9 out of 44 essays. Rater 5, 6, 
and 7 have only one score which is the same for both assessments. However, Rater 8 
scored 9 essays the same. For a better understanding of the rater reliability of general 
impression marking, it is necessary to examine the correlation coefficients between 
the two assessments made by using GIM. The correlation coefficients computed, by 
using Pearson Product Moments Correlation, are presented below in Table 2: 
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Table 2 

Correlations across GIM Assessments 


Rater 

r 

1 

.042 

2 

.510** 

3 

.477** 

4 

.279 

5 

.450** 

6 

.835** 

7 

.584** 

8 

.412** 

9 

.790** 

10 

.880** 


** Correlation is significant at the 0.01 level 


The correlation coefficients, seen in Table 2, range between .042 and .880. Among 
the ten coefficients, two of them, which belong to the raters 1 and 4, are not 
significant. The other correlation coefficients seem significant. This may mean that 
those raters assigned similar scores to the essays in both assessments. However, only 
3 of them are above .70 which refers to a considerably high and meaningful 
correlation (Kline, 1986) and relatively a high consistency. In fact, even the coefficient 
of .70 seems insufficient for a high level of consistency when the intra-rater 
consensus is examined and the results in Table 1 and 2 are compared carefully. For 
example. Rater 10 scored only 3 essays (7%) with no difference and 18 essays (41 %) 
out of 44 with a ±1-5-point difference in spite of the highest correlation coefficient 
obtained (.880) among GIM assessments. 
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Table 3 


Intra-rater Consensus Between ECC Assessments 


Difference 

R1 

R2 


R3 

R4 

R5 


R6 


R7 


R8 


R9 


R10 



£ 

% 

f 

% 

f 

% 

£ 

% 

f 

% 

f 

% 

f 

% 

i 

% 

i 

% 

f 

% 

0 

1 

2 

i 

2 

i 

2 

3 

7 

6 

14 

i 

2 

5 

11 

0 

0 

i 

2 

2 

5 

±1-5 

9 

21 

35 

80 

14 

32 

14 

32 

36 

82 

33 

75 

9 

21 

40 

91 

30 

69 

21 

48 

±6-10 

12 

27 

8 

8 

7 

16 

14 

32 

2 

5 

10 

23 

9 

21 

3 

7 

12 

27 

17 

39 

±11-15 

6 

14 

0 

0 

10 

23 

11 

25 

0 

0 

0 

0 

11 

25 

0 

0 

1 

2 

4 

9 

±15-more 14 

32 

0 

0 

12 

27 

2 

5 

0 

0 

0 

0 

36 

82 

1 

2 

0 

0 

0 

0 

TOTAL 

44 

100 

44 

100 

44 

100 44 

100 

44 

100 

44 

100 

44 

100 

44 

100 

44 

100 

44 

100 


R=Rater 


Table 3 shows that Rater 5 scored 42 essays out of 44 with a ±0-5-point difference 
on 0-100 point scale although there are 6 essays scored with a zero difference. This is 
the highest value among the others referring that 96% of the essays have closer 
results to each other in two ECC assessments. The assessments of Rater 1 have the 
lowest percentage of consensus which is 23% with a ±0-5-point difference. The 
frequency is also 1 for zero difference, and 9 for ±1-5-point difference. Other raters' 
consensus between two assessments by using ECC has a frequency range between 15 
and 40 points. Table 4 also indicates that the percentages of the scores which are the 
same in two assessments have a range between 2 and 14. This means that the 
frequencies range between 1 and 6 out of 44 essays. Rater 8 has no score which is the 
same for two assessments and the raters 1, 2, 3, 6, and 9 have only one score which is 
the same for two assessments. However, Rater 5 scored 6 essays the same. For a 
better understanding, it is necessary to examine the correlation coefficients between 
the two assessments made by using ECC. The correlation coefficients computed, by 
using Pearson Product Moments Correlation, are presented below in Table 4: 
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Table 4 

Correlations across ECC Assessments 
_ Rater _ 

1 

2 

3 

4 

5 

6 

7 

8 

9 

_ 10 _ 

** Correlation is significant at the 0.01 level 

In Table 4, the correlation coefficients range between .072 and .932, this is 
relatively higher than the correlation coefficients across GIM assessments. Among 
the ten coefficients, only one of them, which belong to the scores assigned by the 
rater 1, is not significant. The other correlation coefficients seem significant. This may 
mean that those raters gave similar scores to the essays in both assessments. 
However, 7 of them are above .70 which refers to a high and meaningful correlation 
coefficient and relatively a high consistency (Kline, 1986). Table 5 below shows the 
intra-rater consensus between ESAS assessments: 


Table 5 

Intra-rater Consensus between ESAS Assessments 


Difference 

R1 

R2 


R3 

R4 

R5 


R6 


R7 

R8 


R9 


R10 


t 

% 

£ 

% 

£ 

% 

£ 

% 

£ 

% 

£ 

% 

f 

% 

{ 

% 

f 

% 

f 

% 

0 

9 

21 

5 

11 

3 

7 

3 

7 

6 

14 

2 

5 

2 

5 

6 

14 

3 

7 

3 

7 

±1-5 

18 

41 

21 

48 

18 

41 

3 

7 

28 

64 

18 

41 

24 

55 

24 

55 

23 

53 

24 

55 

±6-10 

8 

18 

4 

9 

11 

25 

6 

14 

10 

23 

7 

16 

8 

18 

12 

28 

12 

28 

10 

23 

±11-15 

4 

9 

14 

32 

7 

16 

7 

16 

0 

0 

6 

14 

2 

5 

2 

5 

4 

9 

6 

14 

±15-more 

5 

11 

0 

0 

5 

11 

25 57 

0 

0 

11 

25 

8 

18 

0 

0 

2 

5 

1 

2 

TOTAL 

44 

100 44 100 44 

100 

44100 

44 

100 

44 

100 

44 

100 

44 

100 

44 

100 

44 

100 


R=Rater 


Table 5 shows that Rater 1 scored 27 essays out of 44 with a ±0-5-point difference 
on 0-100 point scale. This means 62% of the essays have similar results in two 
assessments made by using ESAS. In the assessments of Rater 2, the number of the 
essays scored with ±0-5-point difference is 26, and the percentage is 59%. Rater 3 
scored 21 essays with ±0-5-point difference, which means 48%. Rater 4 is the one who 
has the smallest amount of consistency. The rater scored only 6 essays with ±0-5- 


.072 

.953** 

.517** 

.457 

.955** 

.898** 

.730** 

.932** 

.928** 

.804** 
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point difference, which refers to 14%. In the assessments of Rater 5, the number of the 
essays scored with ±0-5 points difference is 34, which is quite high (78%) when 
compared to others. The results of Rater 6 show that 20 essays were scored with 10-5- 
point difference on 0-100 point scale. Rater 7 scored only 2 essays the same but there 
are 26 essays scored with a ±0-5-point difference. Assessments of Rater 8 indicate 30 
essays have ±0-5-point difference which refers to 69%. In the assessments made by 
Rater 9, the number of essays with ±0-5-point difference is 26. Finally, Rater 10 scored 
27 essays with ±0-5-point difference with a percentage of 62. For a better 
understanding of the rater reliability of the scale, it is necessary to examine the 
correlation coefficients between the two assessments made by using ESAS. The 
correlation coefficients computed, by using Pearson Product- Moment Correlation, 
between the first and the second assessments and they are presented below in Table 
6 : 


Table 6 

Correlations across ESAS Assessments 


Rater 

r 

1 

.757** 

2 

.641** 

3 

.585** 

4 

.021 

5 

.825** 

6 

.680** 

7 

.545** 

8 

.916** 

9 

.811** 

10 

.884** 


** Correlation is significant at the 0.01 level 


The results indicate that the correlation coefficients between the scores raters 
assigned to the essays seem to be high and significant at the 0.01 level (no less than 
.545) except the one which was done by Rater 4 (.021). These results refer that 9 raters 
scored the essays in a significantly reliable way. Moreover, 7 of the correlation 
coefficients are around .70. This is a high level of positive correlation which is seen 
meaningful and which might mean that there is a high consistency between the 
assessments (Kline, 1986). When the results are compared to the others. Rater 4 is the 
one who has the smallest amount of intra-rater consistency, correspondingly, the one 
whose results have the lowest and the only insignificant correlation coefficient. The 
highest correlation coefficient belongs to Rater 8 (.916) whose scores correspond to 
each other. This refers to similar results for two assessments made in different time 
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distances. Moreover, Rater 8 is the one who scored 42 essays out of 44 with ±10 
points difference on 0-100 point scale (intra-rater consensus=95%). This is the best 
result among the raters' assessments; however, the differences among the correlation 
coefficients, even the ones within a 10-point difference in total scores, of the same 
essays scored in different times indicate there is always a source of variation in 
assessments made by ESAS. 


Table 7 

The Comparisons among Correlation Coefficients across Different Assessments 

Raters 


123456789 10 



S 3 


^12 - ^34 


^34 " *56 


0.056 


1.772 


1.648 


p<.05 

2.433 

0.369 

1.572 


0.099 


0.282 


0.176 


0.016 


p<.05 

2.992 


0.481 


0.867 1.657 0.702 


0.849 1.282 1.137 


0.487 2311 1071 

0.107 0.106 0.109 

0.141 0.205 0.961 


0.498 


0.034 


0.531 


In the table showing Fischer's z transformation, f , refers to the correlation 
coefficient between the first two ratings; ^34 refers to the correlation coefficient 

between the following two ratings; and r 56 refers to the correlation coefficient 

between the final ratings. The differences at the significant level (p<0.05) are 
presented in the table. The results indicate that few raters (2, 5, and 8) made 
consistent and decisive assessments in different time distances. As seen in the table, 
no other consistent and decisive assessments were made by the raters using the same 
tools in different time distances. This may mean raters assign different scores to the 
same essays in different time distances. 

Inter-rater reliability 

An analysis of variance was conducted to find out the inter-rater consensus 
statistically. The results are given in the table below: 
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1 

Table 8 

Inter-rater Reliability of Assessments 

Rating 

Sum of 
Squares 

df 

Mean Square 

F 

Sig. 

1 

17554.036 

9 

1950.448 

11.052 

.000 

2 

21461.411 

9 

2384.601 

8.913 

.000 

3 

22407.909 

9 

2489.768 

13.465 

.000 

4 

20462.684 

9 

2273.632 

10.164 

.000 

5 

17570.475 

9 

1952.275 

15.781 

.000 

6 

31722.773 

9 

3524.753 

31.983 

.000 


p<.0.001 

The table shows the output of the ANOVA analysis and whether there is a 
statistically significant difference between group means. The results apparently 
indicate that the paired comparisons of the means of the scores raters assigned to the 
essays significantly differ from each other. It is clearly seen that the significance level 
is 0.000, which is below 0.001 (p <0.001). Therefore, there is a clear statistically 
significant difference in the mean scores assigned by different raters. This might 
mean that there are remarkable differences among scores assigned by the raters to 
the same essay products and the inter-rater reliability of the assessments is 
considerably low. 

A series of a random one-facet (student x rater) model and a random two-facet 
model (student x task x rater) generalizability study for each rating (GIM and ESAS) 
were performed. It could not be realized for ECC ratings because of data loss. In 
addition, the generalizability study could be held for 9 raters as one of the raters was 
not able to provide the data for it as well. Estimated variance components for the 
ratings are given in Table 9 below: 
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Table 9 





1 

Estimated Variance Components (EVC)for 

GIM and ESAS ratings 


Source 

n 


GIM 


ESAS 



EVC 

Total Variance 

EVC 

Total Variance 




% 


% 

Student 

44 

0.258 

0.87 

0.547 

1.39 

Task 

2 

2.241 

7.52 

3.023 

7.69 

Rater 

9 

20.215 

67.86 

25.951 

66.05 

Student x Rater 


1.429 

4.80 

2.255 

5.74 

Student x Task 


0.207 

0.69 

0.317 

0.81 

Task x Rater 


1.989 

6.68 

2.556 

6.51 

Student x Task x Rater 


3.452 

11.59 

4.642 

11.81 

Generalizability 

Coefficient 


0.26 


0.57 



In Table 9, the universal score variance increased from 0.87% to 1.39%. This 
reflects slight differences between those two. The s x t interactions effect seems 
reduced from 67.86% to 66.05% and the s x r interaction seems increased from 4.80% 
to 5.74%. Slightly higher variance was obtained for differences in examinees' 
performance across tasks when the raters assigned scores by using GIM. Besides, the 
s x t interaction reduced from 6.68% to 6.51% when the raters assigned scores by 
using ESAS. However, a pretty higher generalizability coefficient was obtained when 
the scores were assigned using the scale. Moreover, the s x t x r interaction increased 
from 11.59% to 11.81 %. This might mean that inter-rater reliability is more effective 
and advantageous for revealing the differences in quality of students' responses 
when the scale is used to assign scores to the task. 

Standardized Open-ended Questioning 

Standardized open-ended questioning was employed for the instrumentation of 
the qualitative data in order to reveal the views of the raters on assessment processes 
and the types of assessments. It includes the same question -the same stimuli- in the 
same way determined in advance (Patton, 2002). The transcripts were analyzed line 
by line and memos were written (Strauss & Corbin, 1998; Glesne, 1999). Categories or 
labels were reviewed and recurring themes, core consistencies and meanings were 
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identified by using pattern codes (Miles & Huberman, 1994; Patton, 2002). The 
themes were found as : a) criteria use, b) spelling, and c) weightings 

What is immediately apparent from open-ended transcripts is that the criteria use 
is very important and useful in essay assessment because the raters mention that they 
were more precise and the results were more consistent in assessing the essays by 
using the criteria given. One of the raters states that GIM assessments was like 
gambling because they needed to assign a total score to each essay without any 
written or pre-specified criteria. They also state that the criteria use changed the 
tendency of scoring subjectively in a positive manner. In this respect, raters seem to 
have the common idea those assessments by using a checklist or a scale is always 
more objective and reliable. Some teachers state that there should be a criterion for 
spelling. Even if the testees are advanced level writers, they might make spelling 
mistakes and the raters cannot score spelling because it is not one of the criteria in the 
scale. The spelling criterion had not been found appropriate by the judges because 
the task is at an advanced level. Although the raters seemed to have an agreement 
that GIM assessments were not reliable and consistent, they also criticized ESAS 
weightings. They state the criteria should not be equal for each sub-criterion. For 
example, one of the raters says it would be better if each weighting was different for 
each sub-criterion. In this way, it would be more useful and consistent. It would be 
particularly useful to state, considering the transcripts, that criteria use is a reliable 
and agreed measure for assessing essays. However, the criteria should be chosen 
precisely and correctly considering the needs of the students and the weightings of 
the criteria should be independent from each other. In fact, the weightings are 
different for each criterion but the particular teacher seems to think equal weightings 
are used for each criterion. 


Discussion and Conclusions 

The study gives evidence that all methods, techniques, or tools could include 
subjectivity and it seems reasonable to notice that mental processes and internal 
responses of raters function in different ways in using same assessment criteria for 
the same essays in different times. The statistical evidence indicates that GIM 
assessments are never consistent and reliable. The statistical analyses clearly show 
that ECC assessments are more reliable and consistent than GIM ones. The 
correlation coefficients are higher and they are supported by the raters themselves, as 
seen in qualitative data. The results also show that ESAS assessments are also 
consistent and reliable when compared to GIM. However, there is a slight difference 
between the correlation coefficients across ECC assessments and ESAS assessments. 
Yet, the coefficients across ESAS assessments are slightly higher and more 
meaningful than the ones across ECC assessments. This slight difference can also be 
observed by examining the intra-rater consensus between the assessments. It seems 
different weightings for each sub-criterion may result in more consistent assessments 
as raters declared because the results of the difference of correlation coefficients 
which were obtained by using Fischer's z transformation also support the idea that 
the intra-rater scores are similar but not the same. Paired comparisons with ANOVA 
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tell us the inter-rater scores are never meaningfully similar. This means different 
scores are assigned for the same essays in different time distances. It is obvious if a 
lower score is assigned to the same essay in two different sessions around the cut-off 
score, this means success and failure depend on a source of variation. At this point, 
the raters and the time elapsed between assessments may seem as the source of 
variation. The G coefficients also indicate that assigning scores is more precise and 
effective, when the scale is used, as it increases inter-rater reliability. Considering 
several limitations, further research into the effectiveness and usefulness of the scale 
would be valuable as it is difficult to infer what processes are experienced by the 
raters while they are scoring essays. The more pieces of information available, the 
more reliable will be the conclusions drawn from the data (Cherry & Meyer, 1993). 
However, when the total scores and the rater consensus results are examined, it can 
be clearly seen that the scores are different from each other even if the correlation 
coefficients are high and significant. It might be more accurate if Kline's (1986) cut-off 
coefficient (.70) for a meaningful correlation could be increased to .90 at least for 
giving evidence of more reliable ratings. This might mean the scores assigned are 
more similar and closer to each other. A deliberate training and agreement of raters 
before any process of rating for each student group also seems strongly needed on 
the criteria and performance indicators. In order to obtain verbal descriptions as 
concrete information, to recognize this process, and to establish the decision-making 
processes of raters, think-aloud protocols with follow-up interviews can also be 
employed. 
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Appendix 1: 

ESSAY CRITERIA CHECKLIST (ECC) 

-Make a checkmark if the essay includes the following attributes- 
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Appendix 2: 

ESSAY ASSESSMENT SCALE (ESAS) 


CRITERIA 

ATTRIBUTES 

4 

3 

2 

1 

0 

ORGANIZATION 

A.l. INTRODUCTION 







A.1.1. 

Introductory 

Sentences 

Effective introductory 
sentences 






A.1.2. Thesis 

Statement 

Appropriate thesis statement 
(thesis and central idea) 






A.2. BODY 

PARAGRAPHS 







A.2.1. Topic Sentence 

Appropriate topic sentence 
(possibly implied) 
supporting the thesis and the 
central idea 






A.2.2. 

Supporting 

Sentences 

Appropriate sentences 
supporting the topic 
(possibly major and minor) 






A.3. CONCLUSION 

Appropriate conclusion 
related to thesis 






LANGUAGE USE 

B.l. Word Order 

Correct word order 






B.2. Pattern Variety 

Using different patterns 






B.3. Verb Form 

Using verb forms correctly 






B.4. Tenses 

Using tenses appropriately 






B.5. Articles 

Using articles correctly 






B.6. Pronouns 

Using pronouns correctly 






B.7. Prepositions 

Using prepositions correctly 
(verb + preposition, adjective 
+ preposition) 






VOCABULARY 

C.l. Word Choice 

Selecting the appropriate 
words 






C.2. Word Variety 

Having a rich vocabulary 
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C.3. Parts of 
speech 

Using the correct parts of 
speech 






MECHANICS 

D.l. Punctuation 

Using punctuation marks 
correctly 






D.2. Capitalization 

Using cases (lower/upper) 
correctly 






D.3. Paragraphing 

Correct paragraph formatting 






D.4. Indentation 

Using margins correctly and 
consistently 






IDEAS/ 

CONTENT 

E.l. Title 

Appropriate title 






E.2. Development 

Appropriate development 






E.3. Unity 

Unity 






E.4. Transitional 
Signals 

Using appropriate 
transitional signals 






TOTAL SCORE 
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Kompozisyon Puanlamanm Olciilmesi: 

Aym ve Farkli Puanlayici Giivenirligi 

Atif: 

Kayapinar, U. (2014). Measuring essay assessment: Intra-rater and inter-rater 
reliability. Eurasian Journal of Educational Research, 57,113-136 
http://dx.doi.org/10.14689/ejer.2014.57.2 


Ozet 

Problem Durumu: Yazma becerisinin etkili bir bigimde puanlanmasmm 
ara§tirilmasma ili§kin bir hayli gaba gosterilmekte ve birgok oneri sunulmaktadir. Bu 
baglamda, puanlayici giivenirligi, bireylerin gerek egitim gerekse mesleki 
ya§amlarmm farkli donum noktalarmda hayati kararlar vermede gok onemli rol 
oynamaktadir. Aym ve farkli puanlayicilarin farkli puanlama araglari kullanarak 
yaptiklari puanlamalarm da giivenirlikleri puanlama stiregleri ile birlikte 
tarti§ilmalidir. 

Ara$tirmanin Amaci: Ara§tirmanm amaci Ingilizce ogrenicilerinin yazma becerilerinin 
aym/farkli puanlayicilar tarafmdan genel izlenim (GIM), kontrol listesi (ECC) ve 
kompozisyon puanlama olgegi (ESAS) kullamlarak degerlendirilmesindeki olasi 
farklilik ve tutarliliklan ortaya gikarmak ve puanlayici gtivenirliklerini tarti§maktir. 

Yontem: Olgme sonuglarimn tutarliligi ve puanlamalarm gtivenirligine ili§kin yorum 
ve tarti§malarm yapilabilmesi igin nicel ve nitel veriler kullanilmi§tir. Puanlama 
araglan 44 iiniversite ogrencisi uzerinde uygulannu§ ve 10 puanlayici genel izlenim, 
kontrol listesi ve olgek kullanarak bu ogrencilerin yazma becerilerini 
puanlami§lardir. 

Bulgular: Bulgular ve analiz sonuglari genel izlenimle puanlamanm beklendigi tizere 
kesinlikle giivenilir olmadigmi gostermi§tir. Elde edilen korelasyon katsayilan, 
varyans kestirimleri ve genellenebilirlik katsayilarindan elde edilen bilgiler goz 
onune alindiginda, puanlarm aym olmadigi ve sonuglar arasmda daima bir ge§itlilik 
ve varyasyon oldugu goriilmektedir. 

Sonug ve Oneriler: Toplam puanlar ve puanlayicilarm vermi§ olduklan puanlar 
arasmdaki tutarliliklar incelendiginde sonuglarm, korelasyon katsayilari ytiksek ve 
anlamli olsa dahi, gogu zaman aym olmadigi ve birbirlerinden farkli olduklari 
goriilmu§tur. Bu ytizden, yaygm kamnm aksine, kontrol listeleri ve olgekler, 
puanlayicilarm soz konusu araglara yonelik iyi bir egitim almamalari ve olgiitler, 
olgiit tammlari ve performans gostergeleri uzerinde bir uzla§ma saglamadiklari 
takdirde beklendigi gibi etkili bir §ekilde giivenilir olamayabilmektedirler.Bu tiir 
olgmelerde anlamli kabul edilecek korelasyon katsayisimnn en az .90 dtizeyinde 
olmasi durumunda giivenilir puanlamaya kamt olu§turacak olan sonuglar daha 
hatasiz olabilir. Bu durum aym ve farkli yazili yoklamalara verilen puanlarm 
birbirlerine olan yakmlik dtizeylerini artiracak ve daha benzer sonuglarm ortaya 
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gikmasi anlamma gelebilecektir. Her§eye ragmen, hali hazirdaki durum ve sonuglar 
gozoniine almdiginda olgek kullanimmin diger puanlama araglarma gore daha 
glivenilir oldugu vurgulanabilir. Yine de gali§mamn daha fazla puanlayici ile 
tekrarlanmasmin alana katki saglayacagi dli§unulmektedir. 

Anahtar Sbzciikler: Kompozisyon, puanlama, puanlayicilararasi, puanlayici, 
giivenirlik 





